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Abstract  -  Speaker  verification  systems  are  basically  composed 
of  three  stages:  feature  extraction,  feature  processing  and 
comparison  of  the  modified  features  from  speaker  voice  and 
from  the  voice  that  should  be  verified.  Many  features  have  been 
used  in  the  first  stage,  although  the  current  literature  has  not 
already  shown  the  best  of  them.  Based  on  the  biometrics 
hypothesis,  which  states  that  each  individual  has  a  physical 
characteristic  that  distinguishes  itself  from  the  others,  this  paper 
realized  a  comparison  between  12  classical  widely  used 
parameters,  in  order  to  investigate  the  biometrics  hypothesis. 
The  obtained  results  point  out  those  parameters  directly 
correlated  to  speaker’s  anatomy  which  are  among  the  best  ones 
that  can  be  used  in  the  development  of  speaker  verification 
systems. 
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I.  INTRODUCTION 

Physical  characteristics  of  the  subjects  have  been  used  in 
several  scientific  works  concerning  to  subject  verification, 
and  its  use  ranges  from  security  to  forensic  applications  [1], 
The  use  of  such  characteristics  qualifies  the  so-called 
biometrics  technique,  which  states  that  each  subject  exhibits 
some  individual  patterns  that  distinguish  he/she  among  others. 
It  has  been  reported  that  this  approach  can  present  some 
advantages  when  compared  with  others  classical  features,  and 
that  it  is  generally  more  reliable  and  secure  [2], 

The  aim  of  most  of  speaker  recognition/verification  studies 
is  to  develop  a  free-time  system,  not  biased,  fast,  free-text, 
that  present  the  same  accuracy  of  human  being  in  speaker 
recognition.  The  motivation  for  such  studies  is  to  generate 
more  robust  and  reliable  systems  that  can  be  used  in  financial 
security,  psychological  evaluation  [3],  vocal  tract  evaluation 
[4],  as  well  as  to  be  accept  in  forensic  area. 

Speaker  verification  systems  use  many  methods,  such  as 
neural  networks  [5,  6,  7],  gaussian  mixture  model  [8,  9],  data 
fusion  [10],  prony  technique  [11],  cohort  models  [12], 
orthogonal  linear  prediction  [13],  among  others,  to  perform 
the  comparison/classification  of  features,  that  belong 
normally  to  the  set  LPC,  PARCOR,  AR,  Area  Function,  Log 
Area,  etc.. 

The  aim  of  this  paper  is  to  perform  an  objective  comparison 
between  a  set  of  ordinarily  used  speech-derived  parameters  to 
verify  if  the  biometrics  hypothesis  holds  and  if  the  parameters 
directly  correlated  with  the  anatomic  aspects  (Fig.  1)  of  the 
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speaker  really  present  a  good  performance  in  the  design  of  a 
speaker  recognition/verification  system.  The  reason  to  do 
such  work  is  the  contests  presented  in  the  literature 
concerning  to  the  accuracy  and  robustness  of  several 
parameters.  Imperl  [11]  and  Kishore  [5],  for  example,  used 
cepstral  parameters,  Furui  [16]  used  log  area  ratios,  Sambur 
[13]  used  LPC  parameters  and  all  of  them  got  good  results. 
For  this  reason,  and  considering  the  beginning  of  a  speaker 
verification  design,  it  is  reasonable  to  investigate  the  set  of 
the  best  speech-dependent  features. 


Fig.  1 .  Magnetic  resonance  image  of  the  vocal  tract. 


It  will  be  shown  the  results  of  the  comparison  among 
twelve  classically  used  parameters,  using  the  same 
preprocessing  and  comparison  stages  in  the  rest  of  the  system, 
the  Sambur’ s  technique  [13].  Such  results  confirm  the 
hypothesis  that  parameters  linked  to  biometrics  approach  are 
among  the  best  studied  features,  suggesting  their  adoption  in 
the  design  of  speaker  verification  systems  aimed  to  be  widely 
used,  including  forensic  applications  [1], 


II.  METHODOLOGY 

A.  Signal  processing 

Since  the  system  designed  by  Sambur  [13]  will  be  used  as 
the  fixed  part  in  the  comparison,  it  is  important  to  summarize 
such  approach. 

The  several  utterances  (/=  1,  2,  ...  ,  L)  of  the  speech  signal 
from  the  m-th  speaker  that  will  compound  the  base  to  the 
verification  system  are  initially  divided  in  Jjm  fixed-size 
frames  and  stored  in  L  matrices,  where  each  column 
corresponds  to  a  signal  frame.  From  such  matrices  p 
coefficients  are  extracted  from  each  column,  using  the  desired 
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parameters  extraction  technique.  This  procedure  results  in 
new  L  matrices  A/m. 

From  the  parameters  matrices  A/„„  covariance  matrices  R/,„ 
are  calculated.  For  all  the  m  speakers  the  system  will  be 
trained  to  verify  the  so-called  reference  covariance  matrix 
(Rref  ),  which  is  calculated  by  the  weighted  average  of  the 
R]m  matrices,  being  the  weights  the  number  of  frames  in  the 
respective  matrices  (1).  This  procedure  reduces  the  random 
estimation  error  [13]. 


1=1 


where  J/m  is  the  number  of  frames  in  the  /-th  utterance  for  the 
m- th  speaker. 

Given  the  reference  covariance  matrix,  the  statistical 
variance  (eigenvalue)  of  each  orthogonal  parameter  is  first 
found  by  solving  the  set  of  simultaneous  equations 

K?-M=°-  (2) 

The  mutually  orthogonal  eigenvectors  (b,)  are  then  derived 
as  solution  of  the  equation 

K.A=R':!b,  /= 1.2 . p.  (3) 

The  eigenvectors  (&;)  associated  to  the  reference  covariance 
matrix  lead  to  a  conversion  matrix.  The  i- th  orthogonal 
parameter  <j^  in  the  j'-th  frame  of  the  A /„,  matrix,  for  the  m- th 
speaker,  is  obtained  from  the  product  of  the  conversion  matrix 
and  the  parameter  matrix,  as  demonstrated  in  (4).  The  average 
value  of  the  /-th  orthogonal  parameter  for  the  mAh  speaker  is 
given  by  (5). 

♦  <4> 

(5) 

IX"  " 

1=1 

In  the  verification  process  the  unknown  speaker’s  voice  will 
be  initially  processed  in  the  same  way.  This  process  is  based 
on  dissimilarity  between  the  unknown  speaker’s  orthogonal 
parameter  set  and  the  analogous  set  for  each  of  the  m  speakers 
the  system  is  able  to  verify.  Such  dissimilarity  is  based  on  the 
distance  between  the  two  sets  of  parameters,  calculated  as 


where  Z,-  is  the  mean  value  of  the  /-th  orthogonal  parameter 
calculated  across  the  utterance  of  the  unknown  speaker  by  d„„ 
(3);  A,jm  is  the  reference  eigenvalue  for  the  /-th  orthogonal 
parameter  of  the  m-th  speaker;  (p  -  1)  is  the  number  of  the 
first  most  important  orthogonal  parameters  not  included  in  the 
summation,  and  is  the  average  number  of  frames  in  the 
utterance  of  the  m- th  speaker’s  design  set,  calculated  as 

_  j  l 

Jm=jLJl>n  (1) 

L  1=1 

Ordinarily  speaker  verification  process  adopt  a  threshold 
decision,  below  which  the  unknown  speaker  is  claimed  the 
speaker  being  verified.  This  process  is  illustrated  in  Fig.  2. 


Fig.  2.  Schematic  of  a  speaker's  verification  generic  process  using  the 
Sambur’s  method  [13]. 


The  above  version  of  speaker  verification  system  was  used 
in  the  comparison  of  the  twelve  speech-dependent  studied 
parameters,  that  was  the  only  part  that  change  in  each 
developed  system. 

B.  Speech  signal  acquisition  and  pre  processing 

The  signals  used  in  this  work  were  acquired  by  Creative 
Labs  sound  board  (Sound  Blaster  model  CT4500),  with 
44.1kHz  sampling  rate,  16  bits,  mono.  Each  utterance  was 
recorded  in  .wav  file.  Silence  segments  were  manually 
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detected  and  removed  by  the  Creative  Studio  software 
(' Creative  Labs,  version  3.20.0,  1996).  These  files  have  been 
4kHz  lowpass  filtered  (4th  order  Butterworth),  leading  to  the 
signals  submitted  to  the  verification  system,  that  were 
segmented  in  25ms  frames  using  a  Hann  window 

The  parameters  comparison  were  performed  with  a 
polysyllabic  word  presenting  a  wide  number  of  characteristic 
phonemes  of  Portuguese,  the  word  “bioinstrumentafao” 
(bioinstrumentation),  that  was  repeated  five  times  for  each  of 
the  five  volunteer  speakers.  Four  of  the  five  utterances  were 
used  to  derive  the  reference  covariance  matrix  and  the  last 
utterance  used  as  unknown  signal. 

C.  Features 

Several  computational  routines  [14]  were  investigated  to 
the  parameter  extraction  procedure.  The  LPC  parameter 
(twelve  coefficients)  was  adopted  as  the  primary  parameter, 
from  which  the  other  11  parameters  have  been  calculated 
through  linear/nonlinear  conversions  using  the  toolbox 
Voicebox  [15]  and  specific  routines  specially  developed  in 
Matlab  5.2.  The  eleven  studied  parameters  were: 

•  Autocorrelation  Coefficients  (AC) 

•  Area  Coefficients  (AF) 

•  Area  Ratios  (AO) 

•  Complex  Cepstral  Coefficients  (CC) 

•  Formants  frequencies  (FF) 

•  Log  area  Coefficients  (LA) 

•  Log  area  ratios  (LO) 

•  Line  Spectrum  Pairs  (LSP) 

•  Autocorrelation  Coefficients  of  the  inverse  filter’s 
impulsive  response  (RA) 

•  Reflection  Coefficients  (RF) 

•  Z-plane  autoregressive  poles  (ZZ) 

For  each  speech-dependent  parameter  a  performance  table 
was  obtained,  where  each  row  contains  the  known  speaker 
and  each  column  contains  the  unknown. 


III.  RESULTS  AND  DISCUSSION 

The  result  of  the  twelve  studied  parameters  were  organized 
in  tables  like  the  one  shown  in  Table  I,  where  the 
performance  for  the  LSP  parameter  can  be  seen. 


Table  I 

PERFORMANCE  TABLE  FOR  LSP  PARAMETER 


Known 

Speaker 

Unknown  speaker 

1 

2 

3 

4 

5 

1 

10.2 

192.3 

39.4 

162.2 

126.0 

2 

54.6 

5.4 

1361.5 

61.5 

247.7 

3 

72.2 

48.7 

13.7 

57.1 

161.6 

4 

29.9 

37.0 

60.5 

26.4 

237.5 

5 

73.3 

63.0 

130.8 

124.4 

2.7 

The  system  performance  can  be  evaluated  by  the  difference 
between  the  values  in  principal  diagonal  (corresponding  to  a 
correct  verification)  and  the  values  outside  the  principal 
diagonal  (corresponding  to  a  wrong  verification).  If  the 
system  is  efficient,  the  element  of  the  principal  diagonal 
presents  the  minimum  value  of  the  correspondent  row  and 
column.  The  systems  can  be  compared  one  to  each  other 
through  the  ratio  between  the  smaller  value  outside  the 
principal  diagonal  (OutD)  and  the  value  in  the  principal 
diagonal  (InD),  for  the  worst  case,  that  is  represented  by  the 
smaller  value  in  the  principal  diagonal.  The  best  performance 
for  all  the  studied  parameters  in  speaker  verification  system 
will  be  the  one  that  exhibits  the  greater  ratio  OutD/InD.  In 
this  situation,  considering  that  the  preprocessing  and 
comparison  techniques  were  the  same  for  all  the  cases,  the 
best  parameter  will  be  got. 

Table  II  presents  the  result  for  the  parameters  that  exhibited 
correct  verifications  for  all  the  unknown  speakers.  It  was 
observed  that  some  parameters  (formants,  magnitudes  of  z- 
poles,  inverse  filter  coefficients,  area  ratios,  log  area  ratios) 
presented  identification  errors  and  for  such  parameters  a 
performance  comparison  was  not  realized. 


Tabela  II 

PERFORMANCE  COMPARATION  BETWEEN  FEATURES 


Feature 

Performance  values/ratio 

OutD 

InD 

OutD/InD 

RA 

27.54 

09.87 

2.79 

RF 

36.87 

18.38 

2.00 

AF 

36.16 

19.39 

1.86 

LA 

45.05 

28.97 

1.55 

CC 

36.29 

24.80 

1.46 

LPC 

48.64 

36.16 

1.34 

LSP 

29.97 

26.43 

1.13 

V.  Conclusion 

Despite  the  fact  that  literature  has  shown  a  great  number  of 
works  using  cepstral  parameters  [5,  8,  11,  16,  17],  the 
comparison  realized  in  the  present  paper  indicates  that  the 
parameters  correlated  with  the  biometrics  characteristics  of 
the  speaker  are  among  the  best  options  to  the  design  of 
speaker  verification  systems.  Thus,  the  function  area 
parameter  seems  to  be  a  good  choice,  although  it  has  just 
presented  as  the  third  best  result.  Only  a  more  generic  study, 
in  the  sense  of  number  of  speakers  and  different  utterances 
can  really  confirm  if  it  is  worse  than  autocorrelation 
coefficients  of  the  inverse  filter’s  impulsive  response  or 
reflection  coefficients.  It  must  be  mentioned  that  although  all 
eleven  parameters  have  been  derived  from  the  LPC 
parameters,  the  linear/nonlinear  conversion  can  lead  to  a  more 
suitable  parameter  to  the  speaker  verification  system  than  the 
primary  LPC  parameter. 

Despite  the  good  performance  observed  for  the  verification 
system  shown  in  Table  II,  when  used  with  Sambur’s 
technique  [13],  it  couldn’t  be  used  in  forensic  applications 
[1],  For  such  cases  a  probability  that  unknown  speaker  is 
some  of  the  true  speakers  is  desired  instead  of  a  distance 
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value.  Then  one  can  conclude  that  the  research  must  continue  speaker  identification”,  Speech  Communication,  No. 28,  pp. 
in  order  to  develop  a  more  generic  system  that  could  be  more  227-241,  1999. 
widely  applied. 
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